Energy Performance Certificates (EPCs) are an essential component of the building sector’s drive towards more sustainable and energy-efficient buildings. EPCs are documents that outline the energy efficiency rating of a building, providing information on its energy usage and carbon emissions. Analyzing the data on EPCs is crucial in understanding the energy performance of buildings and identifying areas where improvements can be made to increase energy efficiency and reduce carbon emissions.
In this context, statistical analysis can provide valuable insights into the energy performance of buildings, helping to identify opportunities for energy savings, and informing decision-making around building design, retrofitting, and operation. This makes data analysis on EPCs a key tool in the transition towards a more sustainable built environment.
In this report we perform an high dimensional clustering on EPCs data reducing the dimensionality through Principal Component Analysis (PCA), a statistical approach that can be used to analyze high-dimensional data and capture the most important information from it. This is done by transforming the original data into a lower-dimensional space while collating highly correlated variables together.
The report is organized as follows:
The dataset used for the analysis is provided by Regione Piemonte, at https://www.dati.piemonte.it/#/catalogodetail/regpie_ckan_ckan2_yucca_sdp_smartdatanet.it_Sicee_v_datigen_energetici_v2_8407.
The dataset contains 633657 rows and 72 columns, where each row contains the certificate data for a building and in columns there are several attributes, both descriptive and numerical. Numerical variables regard the geometrical and performance attributes of the buildings. The details on the variable meaning are provided in the link mentioned above.
For simplicity, only a subset of these data has been considered, in particular only certificate that were submitted since 2019, for buildings of category E1 (households) and with winter and summer air conditioning and hot water production. In this way the number of certificates analyzed drops to 26304.
In the table below is summarized the dataset:
In this section, outlier detection is performed in order to avoid the presence of clusters with only few buildings that have unusual geometric or performance variables.
The process consists in the detection of values far from the distribution of each single variable considered using the interquartile method, summarized in the following equation:
\[OUT = Q_{1} - 5 \cdot IQR \:\lor\: Q_{3} + 5 \cdot IQR\] where \(Q_1\) and \(Q_3\) are the first and third quartile, while IQR is their difference. We use a coefficient equal to 5 in order to eliminate only very extreme values.
In this case, outlier detection led to the removal of 1079 records, which contained at least one outlier in their set.
After cleaning out outliers, it is possible to conduct the PCA process, with the aim to identify the principal components of the dataset that explain enough the variability of the buildings attributes. To perform this, data are scaled using Z-score standardization and than PCA is computed using prcomp function from the stats package.
The summary of the PCA process is shown below:
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2621 1.9916 1.4391 1.11958 0.93728 0.83721 0.77512
## Proportion of Variance 0.3198 0.2479 0.1294 0.07834 0.05491 0.04381 0.03755
## Cumulative Proportion 0.3198 0.5677 0.6972 0.77551 0.83041 0.87422 0.91177
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.62037 0.56516 0.50636 0.42747 0.34583 0.28181 0.18457
## Proportion of Variance 0.02405 0.01996 0.01603 0.01142 0.00748 0.00496 0.00213
## Cumulative Proportion 0.93583 0.95579 0.97181 0.98324 0.99071 0.99567 0.99780
## PC15 PC16
## Standard deviation 0.17288 0.07251
## Proportion of Variance 0.00187 0.00033
## Cumulative Proportion 0.99967 1.00000
As you can easily see, just the first 6 components are responsible for almost the 90% of the variance in the dataset, as represented in the figure below:
In order to comprehend which variables mostly contributes to each principal components, an heat map with the contribution of the variables in each principal component is shown.
To assess how many principal components keep in the cluster analysis, one of the most-used method is to maintain all PC that have eigenvalues higher than 1, because an eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data.
In this case, only the first 4 principal components have been selected, which account for the 77% of the cumulative variance in the data.
Identified the Principal Components, a K-means clustering step is employed to group buildings with similar features. The number of cluster is choosen using the Davies-Bouldin Index implemented in the NbClust package.
In this case, 9 clusters has been identified, which contain the following number and percentage of EPCs in them:
The final part of this report is the explanation of the clustering, plotting the distribution plot of the variable in the dataset for each cluster, in order to characterize the results.